arXiv:1507.04553vl [stat.CO] 16Jul2015 


Approximate Maximum Likelihood Estimation 


Johanna Bertl* 

Aarhus University, Denmark 

Gregory Ewing 

Ecole polytechnique federale de Lausanne, Switzerland 

Carolin Kosiol Andreas Futschik 

Vetmeduni Vienna, Austria Johannes Kepler University Linz, Austria 

July 17, 2015 


Abstract 

In recent years, methods of approximate parameter estimation have at¬ 
tracted considerable interest in complex problems where exact likelihoods 
are hard to obtain. In their most basic form, Bayesian methods such as 
Approximate Bayesian Computation (ABC) involve sampling from the 
parameter space and keeping those parameters that produce data that 
fit sufficiently well to the actually observed data. Exploring the whole 
parameter space, however, makes this approach inefficient in high dimen¬ 
sional problems. This led to the proposal of more sophisticated iterative 
methods of inference such as particle filters. 

Here, we propose an alternative approach that is based on stochastic 
gradient methods and applicable both in a frequentist and a Bayesian 
setting. By moving along a simulated gradient, the algorithm produces a 
sequence of estimates that will eventually converge either to the maximum 
likelihood estimate or to the maximum of the posterior distribution, in 
each case under a set of observed summary statistics. To avoid reaching 
only a local maximum, we propose to run the algorithm from a set of 
random starting values. 

As good tuning of the algorithm is important, we explored several 
tuning strategies, and propose a set of guidelines that worked best in our 
simulations. We investigate the performance of our approach in simula¬ 
tion studies, and also apply the algorithm to two models with intractable 
likelihood functions. First, we present an application to inference in the 
context of queuing systems. We also re-analyze population genetic data 
and estimate parameters describing the demographic history of Sumatran 
and Bornean orang-utan populations. 

* Johanna.bertl@clin.au.dk 
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1 Introduction 


Both in the Bayesian as well as in the frequentist framework, statistical infer¬ 
ence commonly uses the likelihood function. However, under a wide range of 
complex models, no explicit formula is available. This occurs for example when 


modelling population genetic and evolutionary processes (Marjoram and Tavare 


2006), in spatial statistics (f 

loubeyrand et al. 

2009 

), and in queuing systems 

(Heggland and Frigessi 

200^ 

1). Furthermore, such situations also occur with 


dynamical systems as used for example in systems biology (Toni et al. 2009) 


and epidemiology (McKinley et ah, 2009). Here, we consider statistical models 
with an intractable distribution theory, but a known data generating process 
under which simulations can be obtained. 

A recent approach to overcome this problem is Approximate Bayesian Com¬ 
putation (ABC) (Beaumont et al. 2002). Its basis is a rejection algorithm; 
Parameter values are randomly drawn from the prior distribution, data sets are 
then simulated under these values. To reduce complexity, informative but low 
dimensional summaries are derived from the data sets. All parameter values 
that gave rise to summary statistics similar to those computed for the observed 
data are then accepted as a sample from the posterior distribution. Hence, ABC 
does not involve any kind of direct approximation of the likelihood, but with a 
uniform prior (with support on a compact subset of the parameter space taken 
large enough to contain the maximum likelihood estimator) it can be used to 
obtain a simulation-based approximation of the likelihood surface and, subse¬ 
quently, the maximum likelihood estimate for the observed summary statistic 
values (Creel and Kristensen 2013 Rubio and Johansen 2013). 

ABC has successfully been used in population genetic applications and also in 
other fields (Beaumont 2010). However, the sampling scheme can be inefficient 
in high-dimensional problems: The higher the dimension of the parameter space 
and the summary statistics, the lower is the acceptance probability of a simu¬ 
lated data set. Consequently, more data sets need to be simulated to achieve 
a good estimate of the posterior distribution. Alternatively, the acceptance 
threshold needs to be relaxed, but this increases the bias in the approxima¬ 
tion to the posterior distribution. To reduce the sampling space, combinations 
of ABC and iterative Monte Carlo methods such as particle filters and Markov 


Chain Monte Carlo have been proposed (Beaumont et al., 2009 Wegmann et al. 


2009). 


Here, we follow an alternative approach to obtain an approximate maximum 
likelihood estimate. Instead of using random samples from the whole param¬ 
eter space, we propose a stochastic gradient algorithm that approximates the 
maximum likelihood estimate. Similar to ABC, it relies on lower-dimensional 
summary statistics of the data. In the simultaneous perturbations algorithm, 
in each iteration two noisy evaluations of the likelihood are sufficient to obtain 
an ascent direction (Spall 1992). To this end the likelihood is approximated 
by kernel density estimation on summary statistics of data simulated under the 
parameter value of interest. 

Our algorithm is related to a method suggested in Diggle and Gratton (19841 
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(see also Fermanian and Salanie, 20041. There, an approximate maximum like¬ 
lihood estimate is obtained using a stochastic version of the Nelder-Mead algo¬ 
rithm. However, the authors explore only applications to 1-dimensional i.i.d. 
data. In principle, our approach can also be applied in the context of indirect 
inference, where summary statistics are derived from a tractable auxiliary model 


(Drovandi et al., 2011). 


2 Method 

We will start by describing the so called simultaneous perturbation algorithm 
for optimizing the expected value of a random function using simulations. Next, 
we explain how this algorithm can be adapted to obtain maximum likelihood 
estimates. Details on the tuning can be found in the following section. 


2.1 General simultaneous perturbation algorithm 

Let Y S K be a random variable depending on 0 G The function L(0) = 
E{Y I 0) shall be maximized in 0, but L(0) as well as the gradient VL(0) 
are unknown. If a realization ?/(0) of H | 0 can be observed for any value of 
0, the maximizer argmaxegRp L(0) can be approximated by the simultaneous 
perturbation algorithm (Spall [T992 ). 

Similar to a deterministic gradient algorithm, it is based on the recursion 


0fc = 0fc-i + afcVL(0fc_i), 

where Uk G R’*’ is a decreasing sequence. However, as VL(0fe_i) is unknown, it 
is substituted by a rescaled putative ascent direction that is randomly chosen 
among the vertices of the unit hypercube. Thus, in iteration fc, a random vector 
6k with elements 


„(z) _ J —1 with probability 1/2, , , 

^ 1+1 with probability 1/2, ^ ' 

for z = 1,... ,p is generated, and the ascent direction is approximated by 

j 2/ (0fe + CkSk) - y (0fc - Cfe4) 

- ' 

with Ck G R"*" being a decreasing sequence. 

2.2 Approximate maximum likelihood algorithm 

Suppose, data Dobs are observed under model M with unknown parameter 
vector 0 G R^. Let L{Q]Dobs) = piDobs \ 0) denote the likelihood of 0. For 
complex models often there is no closed form expression for the likelihood, and 
for high dimensional data sets, the likelihood can be difficult to estimate. As in 
ABC, we therefore consider L{Q]Sobs) = p{Sobs \ 0), an approximation to the 
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original likelihood that uses a d-dimensional vector of summary statistics Sobs 
instead of the original data Dobs- 

An estimate L{Q;Sobs) of L{Q]Sobs) can be obtained from simulations un¬ 
der Ad(0) using kernel density estimation. By setting y(0) = L{Q]Sobs), we 
adopt the simultaneous perturbation algorithm to approximate the maximum 
likelihood estimate 0ml of 0. In more detail, the algorithm is as follows: 


Algorithm (AML). Let ak,Ck € be two decreasing sequences. Let Hk be 
a sequence of symmetric positive definite d x d matrices and k a d-dimensional 
kernel function satisfying K{x)dx = 1. Choose a starting value 0o G 
For fc = 1, 2,..., K: 

1. Choice of a likely ascent direction in 0fc-i- 

(a) Cenerate a random vector 5k as defined in equation Q. 

(b) Simulate datasets from and from 

AI(0^) with 0^ = 0fc_i ± Cfedfc. 

(c) Compute summary statistics S~ on dataset DJ and S^ on for 

j = 

(d) Estimate the likelihood L{Q~; Sobs) = p{Sobs I 0~) and L{Q^; Sobs) = 

p{Sobs I 0’’’) from the summary statistics S^ ,..., S~ and S^ ,..., 5”+, 
respectively, with multivariate kernel density estimation (Wand and 
Jones\ 199^ : 


HQ--Sobs) = 


n^/dei(H^ ^ 


n 

(^o»s-S-)) 


and analogously for L{Q'^, Sobs)- 

(e) Estimate the ascent direction \Jl{Qk-i] Sobs) by 


^Cfc^(0fc—1? Sobs) — 5k 


logF(0+; Fobs) - log £(0 ^Sobs) 

2cfc 


2. Updating Qk-' 

0fc 0/i; — 1 “b O-k^Jo^l{Qk—l: Sobs) 

Then, the approximate maximum likelihood (AML) estimate is 0aml := 
Qk- 

If, in a Bayesian setting, a prior distribution 7r(0) has been specihed, the 
algorithm can be modihed to approximate the maximum of the posterior dis¬ 
tribution, ©map, by multiplying L{Q~,Sobs) and L{Q^, Sobs) by 7r(0“) and 
7r(0+), respectively. 
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2.3 Parametric bootstrap 

Confidence intervals and estimates of the bias and standard error of ©aml can 
be obtained by parametric bootstrap: B bootstrap datasets are simulated from 
the model At (©aml) and the AML algorithm is run on each dataset to obtain 
the bootstrap estimates ©aml n ■•■’®amlb- This sample reflects both the 
error of the maximum likelihood estimator as well as the approximation error. 

We compute simple bootstrap confidence intervals that are based on the as¬ 
sumption that the distribution of ©aml — © can be approximated sufficiently 
well by the distribution of ©aml ~ ©amL; where ©aml bootstrap esti¬ 

mator. Then, a two-sided (1 — a)-conhdence interval is defined as 


2©AML - 9(l-a/2)(©AML).2©AML “ 9(a/2)(©AML) 


where (Z(/3)(0 amt,) denotes the /3 quantile of ©amt. i > • • • > ©amt, b (Davison and 
Hinkleylp^ . 


3 Tuning guidelines 

The performance of the algorithm strongly depends on the choice of the se¬ 
quences Ofe and Cfc. However, their optimal values depend on the unknown like¬ 
lihood function. Another challenge in the practical application is the stochas- 
ticity: large steps can by chance lead very far away from the maximum. 


3.1 Choice of Uk and Ck and convergence diagnostics 

Here, we consider sequences of the form 




as proposed in Spall (2003). To reflect the varying slope of L in different direc¬ 


tions of the parameter space, we choose a G (this is equivalent to scaling the 
space accordingly). Choosing a = 1 and 7 = 1/6 ensures optimal convergence 
speed (Spall 2003). The optimal choice of a and c depends on the unknown 


shape of L. Therefore, we propose the following heuristic based on suggestions 
in 

A 


Spall (2003) and our own experience to determine these values as well as 


G N, a shift parameter to avoid too fast decay of the step size in the first 
iterations. 


Let K he the number of planned iterations and b G a vector that gives the 
desired stepsize in early iterations in each dimension. Choose a starting value 
© 0 . 

1. Set c to a small percentage of the total parameter range (usually between 
1% and 5%) to obtain ci. 

2. SetA= [0.1* ATJ. 
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3. Choose a: 


(a) Estimate VI{Qq] Sobs) by the median of ni finite differences approx¬ 
imations (step 1 of the AML algorithm), ciI{Qq, Sobs)■ 


(b) Set 


(i) {A +1)“ . 

^ for 1 = 1 ,..., p. 


(^^oJ{eo;Sobs)) 


As a is determined using information about the likelihood in 0o only, it 
might not be adequate in other regions of the parameter space. To be able to 
distinguish convergence from a too small step size, we simultaneously monitor 
the growth of the likelihood function and the trend in the parameter estimates to 
adjust a if necessary. Every Kq iterations the following three tests are conducted 
on the preceding Kq iterates: 


• Trend test (too small a): For each dimension i = 1,... ,p a trend in 

(0 

0^.' is tested using the standard random walk model 

= ®fc-i + /3 + efc, 

where /3 denotes the trend and Ck N(0,a^). The null hypothesis (3 = 0 
can be tested by a t-test on the differences = 0^,*^ — 0^1^. If a trend 
is detected, is increased by a fixed factor f S K"*". 

• Range test (too large a): For each dimension i = l,...,p, is set 
to // if the trajectory of 0^,*^ spans more than 70% of the parameter 
range. 


• Convergence test: Simulate n 2 likelihood estimates at Qk-Ko at 
0fc. Growth of the likelihood is then tested by a one-sided Welch’s t-test. 
(A standard hypothesis test; testing for equivalence could be used instead, 
see 


Wellek (2010).) 


We conclude that the algorithm has converged to a maximum only if the 
convergence test did not reject the null hypothesis three times in a row and at 
the same time no adjustments of a were necessary. In the applications of the 
algorithm presented in the article, / = 1.5. 


3.2 Kernel density estimation 

To enable the estimation of the gradient even far away from the maximum, 
a kernel function with infinite support is helpful. The Gaussian kernel is an 
obvious option, but the high rate of decay can by chance cause very large steps 
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leading away from the maximum. Therefore, we use the following modification 
of the Gaussian kernel: 


cx 


exp(— ^x) lix'H < 1 
exp ( — i Vx'H~^x ] otherwise. 


In degenerate cases where the likelihood evaluates to zero numerically, we 
replace the classical kernel density estimate by a nearest neighbor estimate in 
step Id: 

If L{Q~ ] Sobs) ~ 0 or/and L{Q^; Sobs) ~ 0 (with denoting “numerically 
equivalent to”), find 


Q- 

:= argminj S'- 
S- 

Sobs 11 


. ,u 

<7+ 

^min 

:= argmin | \\Sf 

Sobs 11 

: J = 1, • ■ 

. ,u 


I - 
J 


and recompute the kernel density estimate L(0 ; s) and L{Q~^;s) in step Id 
using S'~i„ and respectively, as the only observation. 


To be computationally efficient, we estimate a diagonal bandwidth matrix 


Hk using a multivariate extension of Silverman’s rule (Silverman 1986 Hardle 


et al. 2004). Using new estimates in each iteration introduces an additional 


level of noise that can be reduced by using a moving average of bandwidth 
estimates. 


3.3 Starting points 

To avoid starting in very unfavourable regions of the likelihood, a set of random 
points should be drawn from the parameter space and their simulated likelihood 
values compared to find useful starting points. This strategy also helps to avoid 
that the algorithm reaches only a local maximum. 


3.4 Constraints 


The parameter space will usually be subject to constraints (e.g., rates are pos¬ 
itive quantities). They can be incorporated by projecting the iterat e to the 
closest point such that both 0“ as well as 0+ are in the feasible set (Sadegh 
19971 ). Even if there are no imperative constraints, it is advisable to restrict the 


parameter space to a range of plausible values to prevent the algorithm from 
trailing off at random in regions of very low likelihood. 

To reduce the effect of large steps within the boundaries, we clamp the step 
size at 10% of the range of the feasible set in each dimension. 
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4 Examples 

To study the performance of the AML algorithm, we test it on different ap¬ 
plications. The first example is the multivariate normal distribution. While 
there is no need for simulation based inference for normal models, it allows us 
to compare the properties of the AML estimator and the maximum likelihood 
estimator. Second, we apply the algorithm to a queuing process where the exact 
evaluation of the likelihood can require a prohibitive number of steps. The third 
example is from population genetics. Here, we use both simulated data, where 
the true parameter values are known, as well as DNA sequence data from a 
sample of Bornean and Sumatran orang-utans and estimate parameters of their 
ancestral history. 

4.1 Multivariate normal distribution 

When the AML algorithm is used to estimate the mean vector 0 = {fii ,..., /rio) 
of a 10-dimensional normal distribution with a diagonal VC matrix using the 
arithmetic means X = (Xi,...,Aio) as summary statistics, convergence is 
achieved quickly and the results are extremely accurate. 

One dataset is simulated under a 10-dimensional normal distribution such 
that the maximum likelihood estimator for 0 is 0ml = V J\f{5 ■ lio,lio) 
where Iio denotes the 10-dimensional identity matrix and lio the 10-dimensional 
vector with 1 in each component. To estimate the distribution of 0aml, the 
AML algorithm is run 1000 times on this dataset with summary statistics S = 
X. 

For each AML estimate, 1000 points are drawn randomly on (—100,100) x 
... X (—100,100). For each of them, the likelihood is simulated and the 5 points 
with the highest likelihood estimate are used as starting points. On each of 
them, the AML algorithm is run for at least 10000 iterations and stopped as soon 
as convergence is reached (for « 90% of the sequences within 11000 iterations; 
convergence is tested every Kq = 1000 iterations). Again, the likelihood is 
simulated on each result and the one with the highest likelihood is considered 
as a realization of 0 aml- For each evaluation of the likelihood, n = 100 datasets 
are simulated. Based on these 1000 realizations of 0 aml, the density, bias and 
standard error of 0 aml are estimated for each dimension (tab. 0 fig-i- 

The densities of the 10 components of ©aml are symmetric around the max¬ 
imum likelihood estimate with a standard error that is nearly 20 times smaller 
than the standard error of 0 ml and a negligible bias. Bootstrap confidence 
intervals that were obtained from B = 100 bootstrap samples for 100 datasets 
simulated under the same model as above show a close approximation to the 
intended coverage probability of 95%. 

To investigate the impact of the choice of summary statistics on the results, 
we repeat the experiment with the following set of summary statistics: 


(2) 


Xi X2 + X3 X4 + X5 _ X7 _ _ X9_ 

X 2 — X^ + Xq X'j + Xg Xg • Xio 

X 6 + X 4 

For « 90% of the sequences, convergence was detected within 14000 iter¬ 
ations. Bootstrap confidence intervals are obtained for 100 datasets. Their 
coverage probability matches the nominal 95% confidence level closely (tab.[^. 
The components behave very similarly to the previous simulations, except for 
the estimates of the density for components 9 and 10 (fig. [^. Compared to 
the simulations with S = X, the bias of components 9 and 10 is considerably 
increased, but it is still much smaller than the standard error of ©ml- To inves¬ 
tigate how fast the bias decreases with the number of iterations, we re-run the 
above described algorithm for 100000 iterations without earlier stopping (fig.[^. 
Both bias and standard error decrease with the number of iterations. 

Table 1: Properties of ©aml in the 10-dimensional normal distribution model 
using 2 different sets of summary statistics 



S 

= X 


S as in eq. 

2 


dim. 

b 

se 

P 

b 

se 

P 

1 

-0.0016 

0.0546 

93 

-0.0007 

0.0660 

93 

2 

0.0017 

0.0544 

94 

-0.0002 

0.0638 

93 

3 

0.0004 

0.0547 

94 

0.0007 

0.0635 

97 

4 

-0.0033 

0.0567 

90 

-0.0032 

0.0789 

98 

5 

-0.0015 

0.0584 

95 

-0.0000 

0.0748 

90 

6 

-0.0000 

0.0565 

94 

0.0044 

0.0757 

90 

7 

0.0017 

0.0557 

96 

0.0035 

0.0686 

95 

8 

-0.0007 

0.0559 

95 

-0.0030 

0.1140 

96 

9 

-0.0013 

0.0554 

91 

0.0809 

0.0658 

98 

10 

0.0001 

0.0554 

99 

-0.0595 

0.0922 

92 


Bias (6) and standard error (se) of ©aml, estimated from 1000 runs of the AML 
algorithm. Coverage probability of bootstrap 95% confidence intervals (p) with 
B = 100, estimated from 100 datasets. 


4.2 Queueing process 


The theory of queueing processes is a very prominent and well-studied field 
of applied probability with applications in many different areas like produc¬ 
tion management, communication and information technology, and health care 
(Kingman 2009 Gross et al. 2008). Here, we consider a first-come-first-serve 


queuing process with a single server, a Markovian distribution of interarrival 
times of customers and a general distribution of service times (M/G/1). It 
is modelled by the following sequences of random variables: The service time 
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Figure 1: Density of the components of ©aml obtained with S' = X in one 
dataset estimated from 1000 converged sequences with a miminum length of 
10000 iterations by kernel density estimation. Vertical dashed line: ©ml- 




Figure 2: Density of the components of ©aml obtained with S as in eq. ([^ in 
one dataset estimated from 1000 converged sequences with a miminum length 
of 10000 iterations by kernel density estimation. Vertical dashed line: ©ml- 


































































Figure 3: Absolute bias and standard error of the AML estimator of fig and 
fiio using the summary statistics in eq. ([^ estimated from 1000 runs of the 
algorithm on the same dataset. 


Um, the interarrival time W, 


and the interdeparture time Ym for customer 
m = ( Heggland and Frigessi[ 2004). Um and Wm are independent 

and their distributions are known except for the parameters. Here we assume 
Um ~ U ([ 6 * 1,01 + 02 ]) and Wm ~ exp( 03 ). Ym is defined as 


Y — 

J- m. — 


m m— 1 

Um if E < E Y,, 

m m— 1 

Um +^Wi— Yi otherwise 

i=l i=l 


for m = 1,..., M. 

If only the interdeparture times Ym are observed, the evaluation of the likeli¬ 
hood of the parameter vector 0 = ( 0 i, 02, 03) requires an exponential number of 
steps in M ( Heggland and Frigessi||2004 ). Approximate inference has been con¬ 
ducted with indirect inference (Heggland and Frigessif 2004) and ABC methods 


(Blum and Frangois 2010). 


To estimate the distribution of ©aml in this model, we simulate 100 datasets 
of size M = 100 under the same parameter value 0 = (1,4,0.2) and compute 
©AML for each of them. The summary statistics are the minimum, maximum 
and the quartiles of Fi, ■ ■ • ,Fm ( Blum and Frangois) 2010). To estimate the 
standard error of ©aml per dataset, we re-run the algorithm five times on each 
dataset. 

Each time, we randomly draw 100 starting values from the search space 
(0,10) X (0,10) X (0.05,10) and run the AML algorithm on the best five starting 
values for 5000 iterations (runtime of an implementation in R, version 3.0.1 
(2013|): 1.9 hours on a single core). Finally, we estimate the 


R Core Team 
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likelihood on the 5 results and the best one is used as a realization of 0aml- 
Each likelihood estimation is based on n = 100 simulations. For 3 datasets, 
the AML algorithm is run 100 times to obtain kernel density estimates of the 
density of ©aml- 


Table 2: Properties of ©aml in tbe M/G/1 queue 


01 



T-^-1-T 

0.2 0.4 0.6 0.8 1.0 1.2 


02 




0.15 0.20 0.25 0.30 


true 

1 

space 

(0,10) 

mean 

0.942 

median 

0.948 

bias 

-0.058 

mean se 

0.062 

total se 

0.098 

true 

5 

space 

(0,10) 

mean 

5.031 

median 

4.891 

bias 

0.031 

mean se 

0.437 

total se 

1.046 

true 

0.2 

space 

(0.05,10) 

mean 

0.229 

median 

0.226 

bias 

0.029 

mean se 

0.004 

total se 

0.024 


Figures: marginal densities of components of ©aml, estimated by kernel density 
estimation. Black solid line: density of ©aml estimated over 100 simulated data 
sets. Coloured dotted and dashed lines: density of ©aml estimated separately 
in 3 example datasets. Vertical black line: true parameter value. 

Summaries: true: true parameter value; space: search space; mean: mean of 
©aml; median: median of ©aml; bias: bias of ©aml; mean se: mean standard 
error of ©aml per dataset; total se: standard error of ©aml- 


In all three dimensions, the standard error of ©aml is considerably larger 
across datasets than within dataset (tab. [^. This indicates that the approxi¬ 
mation algorithm only adds a small additional error to the error of ©ml- In all 
dimensions the density of ©aml is slightly asymmetric around the true value, 
but only the bias of 63 is of the same order as the standard error. It is proba- 

[Mo| 


bly caused by a strongly skewed likelihood function (Blum and Frangois 


that emphasises the error of the hnite differences approximation to the slope. 
Additional simulations of the likelihood on a grid between 63 and 63 locate the 
maximum around 0.209. Re-running the algorithm for K = 10000 iterations 
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reduces the bias from 0.029 to 0.024. The parameter estimate with the largest 
standard error is 62 which is also the most difficult parameter to estimate with 
ABC (Blum and Frangois 20101. 


4.3 Population genetics 


The goal of population genetics is to understand how evolutionary forces like 
mutation, recombination, selection and genetic drift shape the variation within 
a population. A major challenge in the application of population genetic theory 
is to infer parameters of the evolutionary history of a population from DNA se¬ 
quences of present-day samples only. The questions range from estimating single 
parameters like the mutation rate or the recombination rate to determining a 
complex demographic structure with multiple parameters. 

Since the introduction of the coalescent by Kingman in the early eighties 
(Kingman | l982a|b|cD a very elegant and flexible stochastic process is at hand 


to model the ancestral history of a population. The coalescent can capture pop¬ 
ulation structure, varying population size and different mating schemes. Exten¬ 


sions to incorporate recombination and selection exist as well (Nordborg 2007 


Wakeley 2008). 


However, statistical inference on data obtained from a coalescent process is 
difficult: Under the coalescent, computing the likelihood of a parameter vector is 
a computationally very expensive task even for a single locus, since all possible 
ancestral trees need to be taken into account (Stephens, 2007). The number 


of binary tree topologies explodes with the number of sampled individuals, n, 
e. g. with n = 10, the number of possible topolo gies is 2,571,912,000, with 
n = 1000, it is already 3.01 • 10^®^^ (Wakeley 2008 p. 83, table 3.2). 


Consequently, exact computation of the likelihood is feasible in reasonable 
time for a very restricted set of models and a small number of haplotypes only 
(Griffiths and Tavare 1994 Wu ,2010|. To make use of the large DNA data sets 


generated by modern sequencing technology, approximate methods have been 
developed in frequentist as well as in Bayesian frameworks. In fact, the need 
for inferential tools that allow for large genomic data sets was a major driver 
in the development of ABC. The focus of the early ABC papers was clearly on 


1999 Beaumont et al. 


population genetic applications (Weiss and von Haeseler 1998, Pritchard et al. 


2002 ) and since then it has found a large variety of 
applications in the field of population genetics ( Csillery et al.|[?010 Beaumont 
2010 Bertorelle et al., 2010). 


4.3.1 Population genetic history of orang-utans 

Pongo pygmaeus and Pongo abelii, Bornean and Sumatran orang-utans, respec¬ 
tively, are Asian great apes whose distributions are exclusive to the islands 
of Borneo and Sumatra. Recurring glacial periods that led to a cooler, drier 
and more seasonal climate might have contracted rain forest and isolated the 
population of orang-utans. At the same time, the sea level dropped and land 
bridges among islands created opportunities for migration among previously 
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isolated populations. However, whether glacial periods have been an isolating 
or a connecting factor remains poorly understood. Therefore, there has been a 
considerable interest in using genetic data to understand the demographic his¬ 
tory despite the computational difficulties involved in such a population genetic 
analysis. We will compare our results to the analysis of the orang-utan genome 


paper (Locke et al. 20111 and a more comprehensive study by Ma et al. (Ma 


et al. 20131; the only analyses that have been performed genome-wide. Both 


studies use DaDi (Gutenkunst et al. 2009), a state-of-art software that has been 


widely used for demographic inference. It is based on the diffusion approxima¬ 
tion to the coalescent. The orang-utan SNP data consisting of 4-fold degenerate 
(synonymous) sites are taken from the 10 sequenced individuals (5 individuals 


each per orang-utan population, haploid sample size 10, Locke et al. 
As in 


Locke et al. (2011) and Ma et al. (2013), we consider an Isolation- 


2011 ). 


Migration (IM) model where a panmictic ancestral population of effective size 
splits r years ago into two distinct populations of constant effective size Ng 
(the Bornean population) and Ns (the Sumatran population) with backward 
migration rates fiBS (fraction of the Bornean population that is replaced by 
Sumatran migrants per generation) and fisB (vice versa; fig. [4^. 



deme 1: 


deme 2 


(a) 


(b) 


Figure 4: (a) Isolation-migration model for the ancestral history of orang¬ 

utans. Na, effective size of the ancestral population; fiBS (msb)j fraction of 
the Bornean (Sumatran) population that is replaced by Sumatran (Bornean) 
migrants per generation (backwards migration rate); r, split time in years; Nb 
(N s), effective population size in Borneo (Sumatra). 


(b) Binned joint site frequency spectrum (adapted from Naduvilezhath et al. 

2onl). 


Na is set to the present effective population size that we obtain using the 
number of SNPs in our considered data set and assuming an average per gener- 
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ation mutation rate per nucleotide of 2 • 10 ® and a generation time of 20 years 


(Locke et al. 20111, so Na = 17400. 


There are no sufficient summary statistics at hand, but for the IM model the 
joint site frequency spectrum (JSFS) between the two populations was reported 
to be a particularly informative summary statistic (Tellier et al. 2011). How¬ 


ever, for N samples in each of the two demes, the JSFS has {N + 1)^ — 2 entries, 
so even for small datasets it is very high-dimensional. To reduce this to more 
manageable levels we follow Naduvilezhath et al. (2011) and bin categories of 
entries (fig. 4b). As the ancestral state is unknown, we use the folded binned 
JSFS that has 28 entries. To incorporate observations of multiple unlinked loci 
mean and standard deviation across loci are computed for each bin, so the final 
summary statistics vector is of length 56. 

The AML algorithm with the JSFS as summary statistics was implemented 


as an extra module in the msms package, a coalescent simulation program (Ew¬ 


ing and Hermisson, 2010). This allows for complex demographic models and 


permits fast summary statistic evaluation without using external programs and 
the associated performance penalties. 


4.3.2 Simulations 

Before applying the AML algorithm to the actual orang-utan DNA sequences, 
we tested it on simulated data. Under the described IM model with parameters 
Nb = Ns = 17400, /iss = /isB = 1.44 • 10“^ and r = 695000, we simulated 25 
datasets with 25 haploid sequences per deme, each of them consisting of 75 loci 
with 130 SNPs each. 

For each dataset, 25 AML estimates were obtained with the same scheme: 
1000 random starting points were drawn from the parameter space; the like¬ 
lihood was estimated with n = 40 simulations. Then, the 5 values with the 
highest likelihood estimates were used as starting points for the AML algo¬ 
rithm. The algorithm converged after 3000-25000 iterations (average: « 8000 
iterations; average runtime of msms: 11.7 hours on a single core). 

For the migration rates ^lsb and ^bs, the per dataset variation of the es¬ 
timates is considerably smaller than the total variation (tab. [^. This suggests 
that the approximation error of the AML algorithm is small in comparison to 
the error of ©ml- For the split time r and the population sizes Ns and A^b, 
the difference is less pronounced, but still apparent. For all parameters, the 
average bias of the estimates is either smaller or of approximately the same 
size as the standard error. As the maximum likelihood estimate itself cannot be 
computed, it is impossible to disentangle the bias of ©ml and an additional bias 
introduced by the AML algorithm. As an alternative measure of performance, 
we compare the likelihood estimated at the true parameter values and the AML 
estimate with the highest estimated likelihood on each dataset. Only in 2 of 
the 25 datasets, the likelihood at the true value is higher, but not significantly 
so, whereas it is significantly lower in 12 of them (significance was tested with a 
one-sided Welch test using a Bonferroni-correction to account for multiple test¬ 
ing). This suggests that the AML algorithm usually produces estimates that 
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are closer to the maximum likelihood estimate than the true parameter value 
is. 

To investigate the impact of the underlying parameter value on the quality 
of the estimates, simulation results were obtained also for 25 datasets simulated 
with T twice as large. Here, all parameter estimates, especially r, Ng and 
Ns had larger standard errors and biases (tab. |^. Apparently, the estimation 
problem is more difficult for more distant split times. This can be caused by a 
flatter likelihood surface and by stronger random noise in the simulations. Only 
the migration rates are hardly affected by the large r: a longer divergence time 
allows for more migration events that might facilitate their analysis. 


4.3.3 Real data 


We model the ancestral history of orang-utans with the same model and use 
the same summary statistics as tested in the simulations. Also the datasets are 
simulated in the same manner. 

To study the distribution of 0aml on this dataset, the algorithm is run 
20 times with the same scheme as in the simulations, so we can estimate the 
standard error of ©aml in this single dataset. The best one is used as the final 
parameter estimate ©aml and is used for bootstrapping. 

Confidence intervals are obtained by parametric bootstrap with B = 1000 
bootstrap datasets. The bootstrap replicates are also used for bias correction 
and estimation of the standard error (tab. [^. 


In Locke et al. (2011) and Ma et al. (2013), parameters of two different IM 


models are estimated, denote the estimates ©i and © 2 . Scaled to the ancestral 
population size Na = 17372, the estimates are shown in tab.|^ 

Model 1 is identical to the model considered here, so we simulate the likeli¬ 
hood at ©1 within our framework for comparison. Since logL(0i) = —217.015 
(se = 7.739) is significantly lower than logL(0AML) = —162.732 (se = 7.258), 
it seems that ©aml is closer to the maximum likelihood estimate than the com¬ 
peting estimate. Note, however, that we are only using a subset of the data to 
avoid sites under selection and that the authors report convergence problems of 
DaDi in this model. 

For model 2, the ancestral population splits in two subpopulations of size 
s and 1 — s relative to the ancestral population and the subpopulations expe¬ 
rience exponential growth. Here a direct likelihood comparison with the DaDi 


estimates given in Locke et al. (2011) is impossible. However, a rough compar¬ 
ison shows that the AML estimates for r, Ng and Ng lie between ©i and ©2 
and for ^gg and ^gg they are of similar size. 


4.3.4 Orang-utan data set 

This real data is based on data sets of two publications, |De Maio et ^ (2013) and 
Ma et al. (2013). For the first, CCDS alignments of H. sapiens, P. troglodytes 


and P. abelii (references hgl8, panTro2 and ponAbe2) were downloaded from 
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Table 3: Properties of 0aml in the IM model with short divergence time (r = 
695000 years) 
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Figures and summaries: As in Tab. 
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Table 4: Properties of ©aml in the IM model with long divergence time (r = 
1390000 years) 
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Figures and summaries: As in Tab. 


18 























Table 5: Parameter estimates for the ancestral history of orang-utans 



fJ'BS 


T 

Ns 

Nb 

©AML 

5.023e-06 

4.600e-06 

1300681 

52998 

21971 

A* 

'^AML 

4.277e-06 

3.806e-06 

1402715 
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se 

1.244e-06 
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208391 
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se 
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0 1 
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715590 
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13290 

upper 

6.627e-06 

4.931e-06 

1820852 

67118 

27426 


^This confidence interval was shifted to the positive range. The original value was 

(-9.09e-07,5.72e-06). 


©AML, approximate maximum likelihood estimate; ©aml; bootstrap bias cor¬ 
rected estimate; se*, bootstrap standard error of ©aml; standard error of 
©AML in this dataset, estimated from 20 replicates of ©aml; lower and upper 
limits of the 95% simultaneous bootstrap confidence intervals. All bootstrap 
results were obtained with B = 1000 bootstrap replicates. The simultaneous 
95% confidence intervals are computed following a simple Bonferroni argument, 


so the coverage probabilities are 99% in each dimension (Davison and Hinkley 


19971. 


Table 6: Results from 

Locke et al. 

2011 Tab. S21-1) (same results for Model 

2 reported in 

Ma et al. 

(2013 

l), scaled with N^, = 17400. 



t^BS 

fJ'SB 

T 

Ns 

Nb 
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9.085e-07 

7.853e-07 

6948778 

129889 

50934 

Model 2 

1.518e-05 

2.269e-05 

630931 

35976 

10093 

©AML 

5.023e-06 

4.600e-06 

1300681 

52998 

21971 


population splits in two subpopulations with a ratio of s = 0.503 (estimated) 
going to Borneo and 1 — s to Sumatra and exponential growth in both 


subpopulations (Locke et al. 2011 Fig. S21-3). Here, Nb and Ns are the 
present population sizes. 
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the UCSC genome browser (http://genome.ucsc.edu). Only CCDS align¬ 
ments satisfying the following requirements were retained for the subsequent 
analyses: divergence from human reference below 10%, no gene duplication in 
any species, start and stop codons conserved, no frame-shifting gaps, no gap 
longer than 30 bases, no nonsense codon, no gene shorter than 21 bases, no 
gene with different number of exons in different species, or genes in different 
chromosomes in different species (chromosomes 2a and 2b in non-humans were 
identified with human chromosome 2). From the remaining CCDSs (9,695 genes, 

79,677 exons) we extracted synonymous sites. We only considered third codon 
positions where the first two nucleotides of the same codon were conserved in 
the alignment, as well as the first position of the next codon. 

Furthermore, orang-utan SNP data for the two (Bornean and Sumatran) 
populations considered, each with 5 sequenced individuals |Locke et al.| (|2011|), 
were kindly provided by X. Ma and are available online (http: //www. ncbi . nlm. 
nih.gov/proj ects/SNP/snp_viewTable.cgi?type=contact&handle=WUGSC_SNP&batch_ 
id=1054968). The final total number of synonymous sites included was 1,950,006. 

Among them, a subset of 9750 4-fold degenerate synonymous sites that are poly¬ 
morphic in the orang-utan populations were selected. 


5 Discussion 


In this article, we propose an algorithm to approximate the maximum likelihood 
estimator in models with an intractable likelihood. Simulations and kernel den¬ 
sity estimates of the likelihood are used to obtain the ascent directions in a 
stochastic approximation algorithm, so it is flexibly applicable to a wide variety 
of models. 

Alternative simulation based approximate maximum likelihood methods have 
been proposed that estimate the likelihood surface from samples from the whole 


parameter space that are obtained in an ABC like fashion (Creel and Kristensen 


2013 Rubio and Johansen 2013) or using MCMC (de Valpine 2004). The 


maximum likelihood estimator is obtained subsequently by standard numerical 
optimization. Conversely, our method only uses simulations and estimates of 
the likelihood along the trajectory of the stochastic approximation algorithm, 
thereby approximating the maximum likelihood estimator more efficiently. This 


is along the lines of Diggle and Gratton (19841 where a stochastic version of the 


Nelder-Mead algorithm is used, but through the use of summary statistics we 
can drop the restrictive i.i.d. assumption. In the setting of hidden Markov 
models, Ehrlich et al. (2013) have proposed a recursive maximum likelihood 


algorithm that also combines ABC methodology with the simultaneous pertur¬ 
bations algorithm. A thorough comparison of the properties of different ap¬ 
proximate maximum likelihood methods is beyond the scope of this paper, but 
the three presented examples show that the algorithm provides fast and reliable 
approximations to the corresponding maximum likelihood estimators. 

The population genetics application where we estimate parameters of the 
evolutionary history of orang-utans demonstrates that very high-dimensional 
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summary statistics (here: 56 dimensions) can be used successfully without any 
dimension-reduction techniques. Usually, high-dimensional kernel density esti¬ 


mation is not recommended because of the curse of dimensionality (e.g., Wand 


and Jones, 1995), but stochastic approximation algorithms are explicitly de¬ 


signed to cope with noisy measurements. To this end, we also introduce mod¬ 
ifications of the algorithm that reduce the impact of single noisy likelihood es¬ 
timates. In our experience, this is crucial in settings with a low signal-to-noise 
ratio. 

Furthermore, the examples show that the AML algorithm performs well in 
problems with a high-dimensional and large parameter space: In the normal 
distribution example, the 10-dimensional maximum likelihood estimate is ap¬ 
proximated very precisely even though the search space spans 200 times the 
standard error of ©ml in each dimension. 

However, we also observe a bias for a few of the estimated parameters. 
Partly, for example for one of the parameters of the queuing process, this can 
be attributed to a bias of the maximum likelihood estimator itself. In addition, it 
is known that the finite differences approximation to the gradient in the Kiefer- 


Wolfowitz algorithm causes a bias that vanishes only asymptotically (Spall 


2003), and that is possibly increased by the finite-sample bias of the kernel 


density estimator. In most cases though, the bias is smaller than the standard 
error of the approximate maximum likelihood estimator and can be made still 
smaller by carrying out longer runs of the algorithm. 

As sufficiently informative summary statistics are crucial for the reliability of 
both maximum likelihood and Bayesian estimates, the quality of the estimates 
obtained from our AML algorithm will also depend on an appropriate choice 
of summary statistics. This has been discussed extensively in the context of 
ABC (Fearnhead and Prangle, 2012 Blum et al. 2013). General results and 


algorithms to choose a small set of informative summary statistics should carry 
over to the AML algorithm. 

In addition to the point estimate, we suggest to obtain conhdence intervals 
by parametric bootstrap. The bootstrap replicates can also be used for bias cor¬ 
rection. Resampling in models where the data have a complex internal structure 
catches both the noise of the maximum likelihood estimator as well as the ap¬ 
proximation error. Alternatively, the AML algorithm may also complement the 
information obtained via ABC in a Bayesian framework: the location of the 
maximum a posteriori estimate can be obtained from the AML algorithm. 

The presented work shows the broad applicability of the approximate max¬ 
imum likelihood algorithm and also its robustness in highly stochastic settings. 


With the implementation in the coalescent simulation program msms (Ewing and 


Hermisson 2010) its potential for population genetic applications can easily be 


explored further. 
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